CRUXEval-output: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 std win_rate elo
gpt-4-turbo-2024-04-09+cot 82.0% 0.57% 96.1% 1508.0
claude-3-opus-20240229+cot 82.0% 0.00% 95.6% 1489.5
gpt-4-0613+cot 77.1% 0.66% 93.0% 1392.2
gpt-4-0613 68.7% 0.32% 87.4% 1283.2
gpt-4-turbo-2024-04-09 67.7% 0.30% 86.8% 1267.2
claude-3-opus-20240229 65.8% 0.00% 85.0% 1247.0
gpt-3.5-turbo-0613+cot 59.0% 0.89% 73.6% 1116.3
deepseek-instruct-33b 49.9% 0.47% 61.8% 1024.4
gpt-3.5-turbo-0613 49.4% 0.40% 59.3% 1000.0
deepseek-base-33b 48.6% 0.52% 58.9% 1006.5
codetulu-2-34b 45.8% 0.50% 51.9% 954.3
magicoder-ds-7b 44.4% 0.51% 48.6% 931.3
codellama-34b+cot 43.6% 1.02% 44.0% 884.8
deepseek-base-6.7b 43.5% 0.57% 46.9% 919.3
wizard-34b 43.4% 0.44% 47.8% 920.2
codellama-34b 42.4% 0.56% 43.8% 889.6
codellama-python-34b 41.4% 0.48% 41.7% 873.2
wizard-13b 41.3% 0.50% 43.1% 887.3
deepseek-instruct-6.7b 41.2% 0.41% 42.8% 883.3
mixtral-8x7b 40.5% 0.58% 38.7% 855.5
codellama-python-13b 39.8% 0.52% 39.4% 859.9
codellama-13b 39.7% 0.56% 38.2% 857.8
phind 39.7% 0.46% 38.4% 851.0
codellama-13b+cot 36.0% 1.07% 28.8% 765.3
codellama-python-7b 35.9% 0.54% 32.2% 804.3
mistral-7b 34.3% 0.56% 26.9% 757.1
codellama-7b 34.2% 0.57% 26.1% 754.6
starcoderbase-16b 34.2% 0.55% 28.9% 776.4
phi-2 33.5% 0.55% 27.3% 756.9
starcoderbase-7b 32.2% 0.47% 23.5% 728.3
deepseek-base-1.3b 31.0% 0.57% 24.6% 739.2
codellama-7b+cot 29.9% 1.05% 17.4% 645.0
deepseek-instruct-1.3b 28.7% 0.48% 20.8% 690.9
phi-1.5 27.5% 0.56% 20.5% 687.9
phi-1 21.7% 0.45% 14.7% 610.2